Causal Inference II

Chelsea Parlett-Pelleriti

Review

Core DAG Paths

Fork: z is a confounder (condition)

Chain: z is a mediator (maybe condition)

Collider: z is a collider (don’t condition)

Adjustment

Goal: close back door paths by controlling for an appropriate adjustment set.

  • How do we “control for?”

    • covariates

    • subsetting

    • matching

    • propensity scores

    • regression discontinuity

    • diff-in-diff

Covariates

set.seed(12938)
n <- 250
logit <- function(x){
  return(1/(1 + exp(-x)))
}

extra_curriculars <- rnorm(n)
exam_scores <- rnorm(n)
acceptance <- rbinom(n, size = 1, prob = logit(1.6*extra_curriculars + exam_scores*2))

df <- data.frame(extra_curriculars, exam_scores, acceptance)

# build reg
lin_reg <- lm(exam_scores ~ extra_curriculars + acceptance,
    data = df) %>% summary()

lin_reg$coef %>% round(4)
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        -0.6392     0.0783 -8.1622    0e+00
extra_curriculars  -0.2280     0.0579 -3.9402    1e-04
acceptance          1.2703     0.1187 10.7038    0e+00

Coefficient Interpretation in GLMs

\[ \beta = \rho_{x,y} \cdot \frac{\sigma_y}{\sigma_x} \]

In simple linear regression, we can prove via OLS or MLE that the coefficient for variable \(x\) is a scaled version of the correlation \(\rho_{x,y}\)

But what about when there are \(\mathbf{x} = \left(x_1, x_2, ..., x_n \right)\)?

Coefficient Interpretation in GLMs

# simulate data
n <- 10000
x1 <- rnorm(n)
x2 <- rnorm(n, 0 + 0.5*x1)
y <- rnorm(n, x1*2 + 0.4*x2 - 2)
df <- data.frame(y = y, x1 = x1, x2 = x2)

# regular model
lm(y ~ x1 + x2, data = df)

Call:
lm(formula = y ~ x1 + x2, data = df)

Coefficients:
(Intercept)           x1           x2  
    -1.9826       1.9881       0.4043  

Coefficient Interpretation in GLMs

# regular model
lm(y ~ x1 + x2, data = df)

Call:
lm(formula = y ~ x1 + x2, data = df)

Coefficients:
(Intercept)           x1           x2  
    -1.9826       1.9881       0.4043  
# naive approach
cor(x1,y) * sd(y)/sd(x1)
[1] 2.188284
cor(x2,y) * sd(y)/sd(x2)
[1] 1.204316

Coefficient Interpretation in GLMs

correlation: the correlation between \(y\) and \(x1\)

Coefficient Interpretation in GLMs

semi-partial correlation: the correlation between \(y\) and \(x1\), after subtracting the influence of \(x2\) on \(x1\)

Coefficient Interpretation in GLMs

# regular model
lm(y ~ x1 + x2, data = df)

Call:
lm(formula = y ~ x1 + x2, data = df)

Coefficients:
(Intercept)           x1           x2  
    -1.9826       1.9881       0.4043  
# naive approach
cor(x1,y) * sd(y)/sd(x1)
[1] 2.188284
cor(x2,y) * sd(y)/sd(x2)
[1] 1.204316
# spc approach
# calculate resid
resx2 <- lm(x2~x1,data = df)$residuals
resx1 <- lm(x1~x2,data = df)$residuals

cor(resx1,y) * sd(y)/sd(resx1)
[1] 1.988138
cor(resx2,y) * sd(y)/sd(resx2)
[1] 0.4043375

Coefficient Interpretation in GLMs

# simulate data
n <- 10000
x1 <- rnorm(n)
x2 <- rnorm(n, 0 + 0.5*x1)
x3 <- rnorm(n, 0 + 0.5*x1 - 0.1*x2)
y <- rnorm(n, x1*2 + 0.4*x2 + 0.2*x3 - 2)
df <- data.frame(y = y, x1 = x1, x2 = x2)

# regular model
lm(y ~ x1 + x2 + x3, data = df)

Call:
lm(formula = y ~ x1 + x2 + x3, data = df)

Coefficients:
(Intercept)           x1           x2           x3  
    -2.0028       2.0050       0.4085       0.1900  
# naive approach
cor(x1,y) * sd(y)/sd(x1)
[1] 2.296495
cor(x2,y) * sd(y)/sd(x2)
[1] 1.236363
cor(x3,y) * sd(y)/sd(x3)
[1] 0.9591191
# spc approach
# calculate resid
resx3 <- lm(x3~x1 + x2,data = df)$residuals

cor(resx3,y) * sd(y)/sd(resx3)
[1] 0.1900424

A Note on “Controlling For”

💡 when we say we want to “control for” a variable, what we’d ideally like is to compare subjects who are exactly the same in that variable.

😦 But we often can’t get that data easily

⬇️ All the methods we’re discussing today allow us to get as closer to that ideal when it’s not possible. For each method, ask yourself “how does this make the groups I’m comparing more similar?

Subsetting

n <- 250
logit <- function(x){
  return(1/(1 + exp(-x)))
}

extra_curriculars <- rnorm(n)
exam_scores <- rnorm(n)
acceptance <- rbinom(n, size = 1,
                     prob = logit(1.6*extra_curriculars + exam_scores*2))

df <- data.frame(extra_curriculars, exam_scores, acceptance)

# build reg
lin_reg <- lm(exam_scores ~ extra_curriculars,
    data = df %>% filter(acceptance == 1)) %>% summary()

lin_reg$coef %>% round(4)
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)         0.6840     0.0841  8.1291   0.0000
extra_curriculars  -0.2123     0.0799 -2.6557   0.0089

Matching

Matching creates two groups that are similar based on a set of covariates.

example from: https://theeffectbook.net/ch-Matching.html

Matching

Gender Training Job No Job Total
Man Yes 60 20 80
Woman Yes 12 8 20
Man No 350 150 500
Woman No 275 225 500
Total 697 403 1100

\((60 + 12)/100 = 0.72\) of people in the training group get jobs, and \((350 + 275)/1000 = 0.625\) of people in the control group get jobs. But…

Matching

Gender is a confounder of the effect of training on outcome.

💡 Create two groups where proportion of gender doesn’t vary

Naive Option: throw away data 🗑️ 😠

Matching

Training Group: 80% men, 20% women

Population: 50% men, 50% women

💡 Find weights for each observation that make training group and population comparable

Matching

Gender Training Job No Job Total
Man Yes 60 20 80
Woman Yes 12 8 20
Man No 350 150 500
Woman No 275 225 500
Total 697 403 1100

\(w = 1\) for those in training group, \(w_{man} = \frac{80}{500}\) and \(w_{woman} = \frac{20}{500}\) for control group

Matching

Now:

\[ \text{prop man}_{\text{no train}} = \frac{0.16 \times500}{0.16 \times500 + 0.04 \times 500} = 0.8 \]

\[ \text{prop man}_{\text{train}} = \frac{1 \times80}{1 \times80 + 1 \times 20} = 0.8 \]

💡we’ve re-weighted each sample so that men comprise 80% of the sample!

Matching

Originally \((60 + 12)/100 = 0.72\) of people in the training group get jobs, and \((350 + 275)/1000 = 0.625\) of people in the control group get jobs.

Now, \(\frac{(1 \times60 + 1\times12)}{(1 \times80 + 1\times 20)} = 0.72\) of people in the training group got jobs and \(\frac{(0.16 \times 350 + 0.04 \times 275)}{0.16 \times500 + 0.04 \times 500} = 0.67\) of people in the control group got jobs.


There’s less of an impact once we “control for” the backdoor path of gender.

Matching

select matches vs weighted sample

We could also select members of the control group who match each member of the treatment group: 👯

Matching

Since A, B, and C are confounders of the effect of treat on outcome, we want to get groups that match on A, B, and C.

Matching

  • Distance Matching: observations are similar if they are “similar” in A, B, and C

  • Propensity Score Matching: observations are similar if they have a “similar” probability of being treated

Propensity Scores

In non experiments, confounders can influence both exposure and outcome

But what if we could adjust for your probability of being treated?

Propensity Scores

propensity score: probability of being exposed given some set of covariates \(\mathbf{c}\)

Rosenbaum and Rubin (1983) suggest that controlling for propensity scores can give you an unbiased estimate of the effect of exposure on outcome if:

  • There are no unmeasured confounders

  • Every subject has a non-zero probability of getting exposed

from: https://www.r-causal.org/chapters/08-building-ps-models

Propensity Scores

Propensity scores allow us to “simulate” what would happen if they were treated/not treated. from:https://www.youtube.com/watch?v=PfLYPt9ur4g

Propensity Scores

In this example, \(Z\) impacts the probability that you’re treated (\(P(treat | z = 1)\) = 50% vs \(P(treat | z = 0)\) = 75%). from:https://www.youtube.com/watch?v=PfLYPt9ur4g

Propensity Score Generation

want: \(p\left (\text{exposure = 1} | \mathbf{c} \right)\)


that looks a lot like what a logistic regression of exposure ~ c estimates…

Propensity Score Generation

# from: https://www.hsph.harvard.edu/miguel-hernan/wp-content/uploads/sites/1268/2024/04/hernanrobins_WhatIf_26apr24.pdf
# qsmk -> wt82_71

nhefs <- read.csv("../../Data/nhefs.csv") %>%
  select('wt82_71', 'qsmk', 'sex', 'age', 'race', 'education', 'wt71', 'active', 'exercise', 'smokeintensity', 'smokeyrs')

nhefs %>% head(3)
     wt82_71 qsmk sex age race education  wt71 active exercise smokeintensity
1 -10.093960    0   0  42    1         1 79.04      0        2             30
2   2.604970    0   0  36    0         2 58.63      0        0             20
3   9.414486    0   1  56    1         2 56.81      0        2             20
  smokeyrs
1       29
2       24
3       26

Propensity Score Generation

How likely is it that someone will quit smoking?

lr <- glm(qsmk ~ sex + age + race + education + wt71 +
            active + exercise+ smokeintensity + smokeyrs,
          data = nhefs,
          family = "binomial")

prp_scr <- predict(lr, nhefs, type = "response")

nhefs <- nhefs %>% mutate(prp_scr = prp_scr)
hist(nhefs$prp_scr)

Inverse Probability Weighting

\[ w_A = \frac{1}{p(A|\mathbf{c})} = \frac{\text{treat}}{\text{propensity}} \times \frac{1-\text{treat}}{1-\text{propensity}} \]

where \(A\) is the treatment (\(A = 0\) did not quit smoking, \(A=1\) did quit smoking). For people who did quit, the weights are \(w_{A = 1} = \frac{1}{p(A = 1 | \mathbf{c})}\) and for people who did not quit, the weights are \(w_{A = 0} = \frac{1}{p(A = 0 | \mathbf{c})}\).


Note:
the more unlikely it is that someone should be in their treatment group, the higher the weight that subject has.

Inverse Probability Weighting

 nhefs <- nhefs %>%
   mutate(ipw = (qsmk / prp_scr) + ((1 - qsmk) / (1 - prp_scr)))
 
 nhefs %>% 
   select(qsmk, prp_scr, ipw) %>%
   head(4)
  qsmk   prp_scr      ipw
1    0 0.1237615 1.141242
2    0 0.1896523 1.234038
3    0 0.1805699 1.220360
4    0 0.3176696 1.465566

Inverse Probability Weighting

biased_lm <- lm(wt82_71 ~ qsmk,
                  data = nhefs)
unbiased_lm <- lm(wt82_71 ~ qsmk,
                  data = nhefs,
                  weights = ipw)
# print out
summary(biased_lm)$coef
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 1.984498  0.2288285 8.672423 1.040008e-17
qsmk        2.540581  0.4510799 5.632220 2.105798e-08
summary(unbiased_lm)$coef
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 1.776923  0.2863833 6.204703 7.002251e-10
qsmk        3.322501  0.4077726 8.147926 7.498911e-16

Note: there’s a large about of bias in the estimate if we don’t control for potential confounders!

Diff-in-Diff

Diff-in-Diff

Just kidding.

Diff-in-Diff

Diff-in-Diff

Diff-in-Diff

  • Assumptions: Parallel Trends

Diff-in-Diff

  • Checking Parallel Trends Assumption

Diff-in-Diff: Marketing Example

from: https://www.geteppo.com/eppo-geolift-for-data

Diff-in-Diff: Simple

\[ \underbrace{\left(After_c - Before_c\right)}_\text{diff for control} - \underbrace{\left(After_t - Before_t\right)}_\text{diff for treatment} = DiD \]

\[ \underbrace{\left(146.4 - 134.9\right)}_\text{diff for control} - \underbrace{\left(130.1 - 84.9\right)}_\text{diff for treatment} = \underbrace{-33.7}_\text{diff-in-diff} \]

Diff-in-Diff: Regression

\[ \hat{y} = \beta_0 + \beta_1\text{treat} + \beta_2\text{time} + \beta_3\text{treat}\times\text{time} \]

  • \(\beta_0\) : the control group’s mean at \(\text{treat} = 0\)

  • \(\beta_0 + \beta_1\): the treatment group’s mean at \(\text{treat} = 0\)

  • \(\beta_0 + \beta_2\): the control group’s mean at \(\text{treat} = 1\)

  • \(\beta_0 + \beta_1 + \beta_2 + \beta_3\): the treatment group’s mean at \(\text{treat} = 1\)

Note: The regression estimates the 4 needed means, and gives you standard errors to quantify uncertainty.

Regression Discontinuity

n <- 1000
income_in_k <- rnorm(n,70,20)
medicaid <- income_in_k<= 69
health_quality <- -15 + 0.2*income_in_k + 2*medicaid + rnorm(n,0,2)

Regression Discontinuity

Regression Discontinuity

  • Running Variable: determines whether you’re above or below cutoff

  • Cutoff: threshold for getting treatment

  • Bandwidth: how wide around the cutoff to look

Regression Discontinuity

\[ \hat{y} = \beta_0 + \beta_1 \text{(running - cutoff)} + \beta_2 \text{treat} + \beta_3 \text{(running - cutoff)} \times \text{treat} \]

  • \(\beta_0 + \beta_2\): predicted value at the cutoff for treated subjects

  • \(\beta_0\) : predicted value at the cutoff for untreated subjects

Regression Discontinuity

df <- df %>%
  mutate(r_minus_c = income_in_k - 69)
rd <- lm(health_quality ~ r_minus_c + medicaid + r_minus_c:medicaid,
         data = df)
summary(rd)$coef
                           Estimate  Std. Error    t value      Pr(>|t|)
(Intercept)            -1.276497576 0.148072135 -8.6207819  2.582548e-17
r_minus_c               0.202532345 0.007623427 26.5670984 5.401900e-118
medicaidTRUE            2.229709121 0.205185078 10.8668191  4.505584e-26
r_minus_c:medicaidTRUE  0.005996821 0.010575487  0.5670492  5.708085e-01

You’re Screwed

🤷‍♀️ Sometimes you can’t get an unbiased causal estimate. Causal Inference can at least help you better identify when that’s the case.

Further Topics in CI

  • Front Door Adjustments

  • G-Computation

  • Instrumental Variables

  • Potential Outcomes Framework